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Abstract. We describe an efficient FPGA implementation for the expo- 
nentiation of large matrices. The research is related to an algorithm for 
constructing uniformly distributed linear recurring sequences. The design 
utilizes the special properties of both the FPGA and the used matrices to 
achieve a very significant speedup compared to traditional architectures. 

1 Introduction 

Field-programmable gate arrays (FPGA) offer a number of special options in 
computation. Utilizing the unique properties of an FPGA, some algorithms 
that are impractical to implement on a more traditional architecture can be- 
come both convenient to create and resource-efficient. The programmable ar- 
ray of look-up tables commonly found on an FPGA provide both flexibility 
in creating logic to suit specific needs and naturally lend themselves to great 
parallelism in computations. 

Fast operations on matrices are of great practical interest. Ways to speed 
up certain matrix calculations still find their way into numerous applications. 

Faster implementations of matrix algorithms can be achieved either from a 
"software" point of view, by improving upon the algorithm itself, or from a 
"hardware" point of view, by using faster or differently structured architec- 
tures. 
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Theoretical improvements on matrix algorithms include Strassen's algo- 
rithm [12] and the Coppersmith- Winograd algorithm [2]. The naive algorithm 
for matrix multiplication is a well-known 0(n 3 ) algorithm. Strassen's algo- 
rithm uses an idea similar to the Karatsuba-multiplication. It has a time 
complexity of 0(n lg7 ) by dividing the matrices into sub- matrices. Then by 
multiplying them in a different arrangement, it manages an overall lower multi- 
plication count compared to the classical algorithm. Research implementing it 
on the Cell Broadband Engine can be found in [5]. Strassen's algorithm and its 
applicability to the project is briefly discussed in Section 7. The Coppersmith- 
Winograd algorithm further improves the complexity to 0(n 2-376 ) by combin- 
ing the idea of Strassen with the Salem-Spencer theorem. [9] discusses and 
compares the performance of implementations of these algorithms. 

Numerous research has been done on creating efficient realizations of differ- 
ent matrix operations on different architectures. [8] and [10] both use FPGAs 
to perform matrix inversion. 

The design presented here is an implementation of matrix multiplication 
on an FPGA. Works of similar nature can be found in [1] and [4], dealing 
with FPGA configurations used for floating point matrix multiplication. [11] 
uses an FPGA design for digital signal processing. [3] discusses another FPGA 
implementation for accelerating matrix multiplication. 

The research in this paper is related to an algorithm for the construction of 
pseudo random number generators. It requires the exponentiation of large ma- 
trices to an extremely high power. This allows for numerous optimizations to 
be made on the FPGA implementation, resulting in an extremely fast design. 
A speedup factor of ~200 is achieved compared to a highly optimized program 
on a more traditional architecture. 

We give the details of a design implemented on a Virtex-5 XC5VLX110T 
FPGA that multiplies two 896 x 896 sized matrices. The matrices are defined 
over the mod 4 residue class ring. Using this property and the fact that the 
hardware uses 6-LUTs (Lookup Tables), we describe first a module that com- 
putes the dot product of vectors taken from Z| 8 in a single clock cycle at 
100MHz clock speed. With these modules we construct a matrix multiplier 
module that computes the C € Z^ * 20 product matrix of A € Z,^ 0xl8d and 
B e Zj 8dx10 in d clock cycles at 100MHz. The significance of the value 28 
in the implementation and its experimental determination is also discussed. 
Finally, we describe how to use these modules for multiplying matrices taken 
from Z 896 * 896 . The proposed algorithm deals with the management of stored 
data in such a way that it can be accomplished completely in parallel with 
the computations. The resulting design completes the multiplication in 64800 
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clock cycles at 100MHz. 

Future work for increasing the size of the used matrices, and further opti- 
mizing the design's performance using Strassen's algorithm is also described. 

2 Mathematical background 

The present work is initiated by a method for the construction of uniformly 
distributed pseudo random number generators. (See [7].) The generator uses 
recurring sequences modulo powers of 2 of the form 

u n = CLd-TUn-! + a d _2Un_ 2 H 1" CL u n _ d mod 2 s , en € {0, 1 , 2, 3}, s € Z + 

The theoretical background can be found in [6]. 

The construction assumes that the values do, ai , . . . , a d _i are such that 

x d -a d _ix d_1 a = (x- 1) 2 P(x) mod 2 

holds for some P(x) irreducible polynomial. It is practical to choose P(x) to 
have maximal order, since the order of P is closely related to the period length 
of the corresponding recurring sequence. The sequence u n obtained this way 
does not necessarily have uniform distribution, however exactly one of the 
following four sequences does: 
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For the details see [7]. Finding the sequence with uniform distribution is of 
interest. Let 
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be the companion matrix of sequence u. To find which of the above se- 
quences has a uniform distribution, we have to compute M(u) 2 ~ 2 mod 4. 
If M(u) 2 + ~ 2 mod 4 equals the identity matrix, then the period length of u n 
is 2 d+1 — 2, which means it is not the sequence we are searching for. 

The exponentiation of matrices to high powers can quickly become time 
consuming on traditional computers. The aim of the project was to utilize 
the special properties of an FPGA to achieve a significant upgrade in speed 
compared to implementations on more traditional architectures. 

3 Hardware used in the implementation 

The project was implemented on a Xilinx XUPV505-LX110T development 
platform. The board features a variety of ports for communication with the 
device. As a first approach the RS-232 serial port was used to send data 
between the board and a PC. A high-speed PCI Express connection is also 
available if the amount of data transferred would necessitate its use. 

The board's most prominent feature is the Virtex-5 XC5VLX110T FPGA. 
The FPGA's main tool for computation is the array of 6-input look-up tables, 
arranged into 17280 Slices, with four look-up tables found in each Slice, adding 
up to a total of 69120 LUTs. A single 6-input LUT can store 64 bits of data, 
where its six input bits are used as an address to identify the single bit of data 
that is to be outputted. By manipulating the 64 bit content of the look-up 
table, it can be configured to carry out arbitrary Boolean functions with at 
most six input bits. In our design they are used to create LUTs performing 
a multiply-accumulate function, which are hierarchically arranged into larger 
and more complex modules. One out of four LUTs on the device can also be 
used as a 32 bit deep shift register; these are the basis to implement containers 
storing the data, which is directly fed to the computational module. 

Attached to the board, there is a 256MB DDR2 SODIMM module, which 
is used for storing data exceeding the amount that can be practically stored 
on the FPGA. 

4 Structure of modules used in the computation 

The basic elements of the design are the LUTs denoted by L(a, b, s) = c, where 
a, b,c and s are two-digit binary numbers. The function carried out by L is a 
multiply-accumulate (for short: MA) function, i.e.: 

c = (a • b) + s mod 4 . 
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Let a = 2ai + ao, b = 2(3i + |3o, s = 2o"i + ao, c = 2yi + yo, where 
cxo, <X] , po, pi , (To, £Ti,yo,yi € {0,1}, and L = (li,lo) where li and lo are two 
single bit LUTs, according to the following: 

• k>(ao, Po, Co) =yo 

• bi(a,b,s) = yi 
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Figure 1: The structure of L(a, b,s) 

We remark that while lo needs only three input bits to accomplish its func- 
tion, li requires all six bits of input. 

The LUTs lo and li were configured to the values shown in Table 1 and 
Table 2 to perform the multiply-accumulate function. 
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Table 1: Contents of Iq 



With the help of these basic units one can compute the dot product w of 
two vectors u = (uo, ui , . . . , Un_i ) and v = (vq, Vi , . . . , v n _i ) . Let us define a 
module m = (L[0], L[l ],..., L[n — 1]) by cascading n MA units denoted by L[i]. 
In this module m we use the output of a given MA unit as the sum input of 
the next unit, i.e. Si+i = Ct for i = 0, 1 , . . . , n — 2, where Si and Ci are the s 
input and c output of L[i]. 

Therefor m is a function that accepts a pair of vectors u,v of two-digit 
numbers of length n and outputs on c n _i the two-digit dot-product of the two 
vectors, i.e. m(u,v) = w. 
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Table 2: Contents of I, 
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Figure 2: The structure of m(u,v) 



In total, the number of LUTs used in m is 2n. Note that vectors of arbitrary 
length can be used in the computation if we connect the output of module m 
to the sum input of L[0] (c n _i = so), and then iteratively shift u and v onto 
the module's input by n elements at a time: 
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Function iterated_m(u, v) // k = length(u) = lerigtri(v) 

1. Define k = v',u' € 

2. for 1 = to k ■ n — 1 do // fill v and u with 0's 

3. if i < k then v- = Vi else v[ = 

4. if i < k then u( = Ut else u! = 

5. end for 

6. Define v te mp, u-temp, w, let w = 

7. for i = to k — 1 do // shift v' and u' to v teT ap an d u te mp 

8. Vtemp = ( v i / -n) v l' + (i.n)' • • • > V n-1+(t-n)) 

9. u temp = (u^u^^, . . . ,u n _ 1+fi . n) ) 

10. W = W + m(v temp , Utemp) 

11. end for 

12. return w 
end Function 



Here u' and v' are the extensions of u and v by 0's. 

We shall see that the number chosen for n is critical in setting many char- 
acteristics of the entire project. The experiment used for determining n will 
be discussed in the following chapter. 

Our aim is to obtain a module that performs the matrix multiplication of 
A, B € 1^ xk , where Z4 is the mod 4 residue class ring. In the following, let 
C 6 Z^ xk be the output matrix, such that C = A x B. Furthermore, let at be 
the ith row of matrix A and let bj be the jth column of matrix B. 

The multiplier units denoted by m are used to create more complex mod- 
ules in a hierarchical manner. First, by taking ten m multiplier blocks we 
create a row of multipliers R = (mo, mi , . . . , IU9). This is used to compute ten 
consecutive elements of a single row of the output matrix: 

R(ii,bj,b j+1 ,...,b j+9 ) = (c g ,c i)j+1 ,...,c i)j+9 ) , 

where = • bj . The input vector at is used by all ten multiplier units of R. 
The length of these vectors, as mentioned above, can be arbitrary, but vectors 
of length greater than n will need to be iteratively shifted to the input of R. 

By taking ten row multipliers we can create a unit Mioxio = (Ro> Rl> • • • > R9) 
which outputs a 10 x 10 sub-matrix of C: 
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M 10x io(ai> CLi+i, . . . , cii + 9,bj,bj + i, . . . ,bj + 9 



cy+9 \ 

c i+l,j+9 



c i+1,j c i+1,j+1 
\ c i+9,j Ci+9,j+l • • • Ci+9j+9/ 

Finally, four such units are arranged so that a 20 x 20 sub-matrix of C could 
be obtained as output: 

M-20x2o( a i> CH + i , . . . , Cli+19, bj, bj +1 , . . . , bj +19 ) = 



c i+1,j c i+1,j+l 



c i,j+19 \ 
Ct+1,j+19 



\ c i+19,j Ci+19,j+1 • • • Ci+19^+19/ 

The M.20x2o' s inputs are twenty vectors from both matrices A and B. Because 
of hardware constraints — in particular the number of LUTs on the used device 
— a larger arrangement of multipliers would be impractical to implement. The 
module M.20x20 is comprised of 400 m multiplier units. Figure 3 shows the 
hierarchy of units used to build M2o X 20- 
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Figure 3: The structure of M.20x20 

The M20X20 un it can be used iteratively to multiply matrices of arbitrary 
size, producing 20x20 sub-matrices of the output matrix C with each iteration. 
After inputting twenty rows from matrix A and twenty columns from matrix 
B and obtaining the desired output, we can simply repeat the process for a 
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set of rows and columns of A and B respectively, until we obtain the entire 
output matrix C: 

Function large_matrix_mult(A, B) 

1. Define k= A',B',C e Z* nXKn 

2. for i = to 20k - 1 do 
3. for j = to 20k - 1 do 

4. if i < k and j < k a[- = else a[- = 

5. if i < k and j < k \)[- = by else b~ = 

6. end for end for 

7. for i = to k — 1 do 
8. for j = to k — 1 do 

9 - C '[i+19,j+19] = M 20x20(li, di+i, . . . , Oi+i9,bj,bj+i,. . . ,bj + i 9 ) 

10. end for end for 

11. return C'^^] 
end Function 



Here 
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Note that in the naive algorithm large_matrix_mult(A, B), during the main 
loop (lines 7-10), for each twenty rows read from A, the entire matrix B is read. 
During the whole procedure, matrix A will be read entirely exactly once, while 
matrix B will be read k times. Methods improving on this number are described 
in section 6. 

Since for almost all practical cases the size k of matrices A, B € l^ xk will be 
greater than the parameter n, the vectors taken from these matrices will need 
to be iteratively shifted onto the input of the multiplier M.20x20> n elements 
at a time. Therefore, an efficient way to both store and then use the vectors 
taken from the matrices is the creation of FIFO type containers made of shift 
registers. 

Let be a shift register of width n and depth d. It means that can store 
at most d vectors of length n, or equivalently a single vector of length at most 
nd. We choose d such that nd > k, thus it can store one row or column from 
the input matrices A or B. Let the vector filling be f = (fo, fi, . . . , fd-i ), 
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where fi € ZJ, t = 0, 1,...,d — 1. In practice, t£ is a queue data structure. 
In a single step, outputs a vector of length n and shifts its content by n 
places. For the i th activation, the container will output ft. After d activations, 
the container becomes empty. 

One container is used to store a single row or column of matrices A 
or B respectively. Connecting twenty of them in parallel, denoted by Tjj^ = 
(t£ [0] , t£ [1 ] , . . . , t£ [1 9] ) , we obtain a container that stores twenty rows or column- 
s. This is exactly the amount of data the M20X20 multiplier structure requires 
as input in d iteration steps. After d activations Tjj^ has shifted all its stored 
data to M20X20! broken up into pieces of length n for each activation. Two 
such T2Q n containers are connected to M20x20> one for the rows taken from 
matrix A and one for the columns taken from matrix B. 



fi_ 



rpd 

1 20n 










Figure 


4: The structure of 





Using M.20x20 an d in a proper structure, we can execute one iteration 
cycle of the computation. After filling one container with the desired 
twenty rows from matrix A and one container with the desired twenty 
columns from matrix B, we simply send d activation signals to the containers. 
This will shift the data onto M20X2O) which computes the 20 x 20 product 
matrix in the way described in function iterated_m(u, v). The number of 
steps in one iteration cycle is d. 

5 Experimental determination of parameters 

Now, we turn to the determination of n (how many MA modules should be 
connected into a single multiplier m) . This sets the length of the vectors that 
we use in the computation in a single step and thus has an effect on many other 
technical parameters of the design. The goal was to find the greatest number 
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such that the multiplier would still reliably produce the correct dot product 
in a single clock cycle. Clearly, this number dependents on the used hardware 
and the clock frequency. For the device used, the chosen clock frequency was 
100 MHz, the default frequency provided by the board. 
The following experiment was devised to determine the value of n: 
Let S be a multiplier m, called the "Subject" , and let Eo, Ei , . . . , E9 be ten 
more m multipliers, called the "Examiners". Informally, the Examiners' duty 
was to verify the answers given by the Subject to questions they already knew 
the answer to. The "questions" here are test data: two vectors v, u of length n 
generated by the following sequence to obtain suitable pseudo-random values: 

D t = Di_! + D t _ 2 + 2Di_4 + Dt_ 5 , 

where D = Dt = D 2 = D 3 = 0, D 4 = 1 . 

More formally, let p S 7L\§ be a counter that cycles between values 0, 1 , . . . , 9, 
incrementing its value by one with each clock cycle, and returning to value 
after 9. For each clock cycle during the experiment, the following happens 
depending on the value of p: 

• The output of S is checked for equality with the output of Ep . If inequality 
is detected, then an error is noted. 

• The test data Ep + i is currently working on is given to S. 

• New test data is given to Ep_i . 

Procedure testing 

1. Let S, Eo, Ei , . . . , E9 be m multipliers 

2. Let D be the test data generator 

3. Let i £ N,p € Z 10 

4. forever do 

5. 1 = 1+1 

6. p = i mod 10 

7. if S out 7^ Ep_ out then return ERROR 

8. E p _ Un <- D(i) 

9. Si n ^ Ep_|_i_i n 

10. end forever 
end Procedure 

Note that the output of S is checked every clock cycle, which yields that 
S has only a single cycle to calculate its answer to the question it was given 
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Figure 5: Activity of testing module when counter's value is p 



in the preceding clock cycle. A given Examiner, however, has ten times more 
time to work on its test data. Once in every ten clock cycles, new data is given 
to the Examiner to work on, and its output is only checked nine clock cycles 
later, just before it is given new input again. This way the Examiners have 
enough time to compute the correct answer to the question by the time it is 
needed. 

As the initial value for n, we have chosen 16, a number small enough to 
be reasonably expected to pass the criteria set for n, but large enough to 
be of interest. If the experiment reported no error, meaning the Subject was 
flawlessly able to calculate the dot product for a sufficiently long time, then 
the value of u was increased and the experiment repeated. After the first 
error was encountered, meaning the Subject was not able to keep up with the 
calculations, the largest value was chosen for n for which there were no errors. 

On the used device, the largest such value was found to be 28 at a clock 
speed of 100 MHz and setting the length of m multipliers to u = 28 were able 
to work error-free for days without interruption. 
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6 Computation of large matrices 

In the Section 4 we gave an algorithm for using the described modules for 
computing the product of large matrices. Following the description, the im- 
plemented design would make use of the parallelism offered by the FPGA 
only in the computation of dot products. Making further use of parallel oper- 
ations, the design's performance can be significantly improved. In this section 
we describe the implementation choices made to raise the overall performance. 

The biggest factor to consider is the management of data. When comput- 
ing the product of large matrices, the amount of data to store and to move 
between the computation modules can easily exceed the size which can be 
practically stored on the FPGA. Fortunately, as mentioned before, a 256MB 
DDR2 SODIMM is connected to the board as the main data storage device. 
A module is generated using the Memory Interface Generator v3.5 intellectual 
property core provided by Xilinx to implement the logic needed to communi- 
cate with the DDR2 RAM. The module is structured hierarchically, connecting 
the memory device to a user interface. All communication with the device is 
done through two FIFO queues: one queue to send the command and address 
signals, while the other queue is used for write data and write data mask (when 
masking is allowed). 

A naive utilization of the memory would be to simply read the required data 
before each iteration of the computation, and writing the output back after 
it is finished. An undesirable effect of this approach would be that the design 
would spend significantly more time with memory management than with the 
actual computation. The desired result would be that memory management 
(and all other auxiliary operations) were done during the time interval of the 
computation. Note that since both the size of the matrices and the multiplier 
module is fixed, the time the multiplication consumes is a fixed constant, which 
cannot be lowered. Optimally, the time of the computation should be an upper 
bound for the running time of the entire design. The difficulty of reaching this 
optimum lies in the high speed of the multiplier modules compared to the 
memory module. 

One way to resolve the problem caused by slow transmission speed is to 
increase the amount of data stored on the FPGA. Informally, the main idea is 
to keep enough data in a prepared state, i.e. by the time the multiplier module 
finishes all of its computations, we have enough new data to continue working. 
More formally, let us define the following quantities: 

• Let d be the time necessary to complete one iteration of the computation. 
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As described in the previous sections, this is equal to the depth of the 
containers T^ 0n . 

• Let k = where k is the size of the matrices. (A, B € Z kxk ) This 
quantity is already used in algorithm large_matrix_mult. For the rest 
of the section, it is practical to think of A and B as k x k sized block 
matrices, where each element is a 20 x 20 matrix. 

• Let f(A, B) be an arbitrary algorithm executing matrix multiplication 
on A and B, including the memory management needed for the com- 
putation. Let K(f) be the number of times the algorithm needs to fill a 
~^20n container, i.e. the number of times it has to read twenty rows or 
columns from the matrices. Note that completely reading either input 
matrices once means filling T^ n containers k times, since one can 
store twenty rows or columns at a time. Algorithm large_matrix_mult's 
main loop (starting at line 7) reads twenty rows from matrix A (filling a 
Tzon once ) an d reads matrix B entirely for each step. Since the loop has 
k steps, it follows that K(targe_matrlx_rn.utt) = k 2 + k. 

• Let 5 be the time it takes to fill a T^ n container. This quantity depends 
on both the width and depth of the container. The total time f(A, B) 
spends on reading from memory to fill the containers is K(f)6. 

• Let O(f) be the total time the design has to spend with memory man- 
agement. This is the sum of the time it spends on reading matrices A 
and B from the memory and the time it spends on writing the product 
matrix C into the memory. The number of times f has to read A and 
B from the memory depends on f . Note that since the size of the total 
output matrix C is the same as the size of A and B, writing C into the 
memory takes time equal to reading either matrices once from the mem- 
ory. In other words, it takes k6 time. The total time the design has to 
spend with memory management is ®(f) = K(f)5 + k5. 

• Let r(f) be the time f(A, B) spends on the computation itself. From the 
definition of d and k it follows that C(large_matrix_mult) = dK . 

The goal here is to reduce K(f) in such a way that the data required for the 
next iteration of the computation is always ready by the time the previous 
iteration ends. If this arrangement is achieved then C(f) becomes the upper 
bound for the running time of the design. 
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Storing more data on the FPGA can be done by adding more containers 
to the design. During an iteration only two such containers are used directly. 
The rest can be used to load data necessary for the forthcoming iteration steps. 

Suppose the design has z + 2 pieces of T^ n containers. We assign z of the 
containers to store rows from matrix A, called "row-stores", and two of them to 
store columns from matrix B, called "column-stores". With this arrangement, 
we can carry out z — 1 iterations of the computation, using up the data stored 
in z — 1 row-stores and one column-store. This leaves one row-store and one 
column-store to load new data into during the computation. Using the above 
definitions, the allover computation takes (z— l)d time. 
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Figure 6: Configuration of data stored on the FPGA 

If we use all z row-stores and one column-store for the computation while 
the remaining column-store is devoted to loading new columns into, then we 
would have to load all z row-stores with new rows once we read all the columns 
before we can continue the computation. This would take z6 time for each case 
where we read all the columns but haven't read all the rows yet, which happens 
|_~J times. In total, it would add |_f Jz5 to the running time. 

Instead, the computation of the output matrix moves slightly diagonally. 
See Figure 7. The z — 1 row-stores used in the computations store a total of 
(z — 1 ) • 20 rows. Initially, the row-stores are filled with rows do — > Q( z -i).20-i • 
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New rows are loaded in at a slower pace than columns are. By the time all 
columns are read once, the contents of the row-stores have shifted exactly to 
the next segment of data needed, the next (z — 1 ) -20 rows. After matrix B is 
completely read once, the row-stores are filled with rows Q( z _i)2o ~~ * a 2(z-i)-20-i • 
Reading rows and columns proceeds in this manner until we've completely 
read matrix A once. For this reason, it is practical to choose z such that 
(z — 1 ) | k. All together we read matrix B times and matrix A once. 

During each z— 1 iterations shown in Figure 6, twenty new columns and ^ z ^' 20 
new rows are loaded into the column-store and row-store currently unused by 
the computation. When the unused row-store is filled with twenty new rows, it 
becomes active, to be used in the following iterations. The row-store containing 
the rows with the least index becomes inactive in the computation and starts 
accepting the new rows read. 




Figure 7: Progression of computations through matrix C 



Function tmproved_matrlx_mult(A, B) 

1. Define z, k = A', B', C € Z 4 K ' nXK ' n 

2. for i = to 20k - 1 do 

3. for j = to 20k - 1 do 

4. if i < k and j < k then a[- = else = 

5. if i < k and j < k then b-= = by else b(j = 
6. end for end for 
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7. Fill the row-stores with rows do — > O-( z -i)-20-l 

8. Fill the column-stores with columns bo — bi9 

9. For i = 1 to ^ 

Do in parallel: (perform z — 1 iterations of the computation 
|READ the next 20 columns mod k • 20 
IREAD the next (z ~ 1 K )-2 ° rows mod k • 20 
|WRITE the result of the previous z — 1 iterations 

11. return C'^ ^] 
end Function 



The possible values for the parameters used in this section depend on the 
used hardware. 

The size of the matrices used in the implementation are determined by 
parameters u = 28 and d = 32. The LUTs on the device that comprise the T^ n 
containers can be configured as d = 32 bit deep shift registers. For this reason 
the matrices are of size 896 x 896. Rows with length k = 896 are the largest 
that can be stored in containers that are one LUT deep, making them any 
larger would double the number of LUTs needed for creating a TiL,. Because 
of the limited number of LUTs which can be used for storage purposes, z = 10 
was chosen. This yields that twelve containers are defined in the design. 
Dealing with matrices larger than k = 896 is part of future work. 

For convenience, time quantities are measured in clock cycles at 100MHz, 
the clock speed of the M20X20 multiplier. 

The value of 5 depends on the DDR2 RAM used. The device was used at 
200MHz, and has a 64 bit wide physical data bus. 

From these values we determine the following parameters: 

• k = [^i=45, 

• K(impToved_matrtx_mult) = ^ k + k = y • 45 + 45 = 270, 

• 5 = 140 clock cycles at 100MHz, 

• ® (improved_matrix_mult) = K(tmproved_matrix_mult)5+i<6 = 270- 
140 + 45 • 140 = 44100 clock cycles at 100MHz, 

• r(improved_matrlx_mult) = die 2 = 32 • 45 2 = 64800 clock cycles at 
100MHz. 
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The goal of r(improved_matrix_mult) > ® (improved_matrix_mult) is a- 

chieved, meaning that the running time of the design is equal to the time used 
by the computation. 

The speedup provided by the configuration can be shown by comparing its 
performance to a similar implementation created on a more traditional archi- 
tecture. A highly optimized C++ program was created for a machine using 
an Intel E8400 3GHz Dual Core processor with 2GB RAM. The algorithm is 
strongly specialized for the task, making use of all available options for in- 
creasing performance. It uses 64 bit long variables to perform multiplication 
on 16 pairs of two-digit elements at once in parallel on both processor cores. 

The running time of the multiplication of matrices of the same size is over 
100 ms. The FPGA implementation, as mentioned above, achieves a runtime 
of ~0.6 ms. On average, a speedup factor of 200 is reached using the described 
FPGA design. 



7 Future work 

The future course of research will focus on increasing the size of the used 
matrices. 

As mentioned in the previous section, simply increasing the depth d of the 
~^20n containers would be impractical. Since a single LUT on the device can 
only be configured as a 32 bit deep shift register, setting d > 32 would double 
the number of LUTs needed for a T 2 ^ n , and the design is already using well 
over half of the device's LUTs that can be configured this way (13440 out of 
17280, to be exact). Increasing the size of the matrices this way would require 
the restructuring of both the multiplier module and the algorithm used for 
memory management. 

Instead, the currently implemented module can be used as a basic unit for 
the multiplication of larger matrices. Then the entries of the large matrices 
are 896 x 896 blocks. 

This also allows for further optimization using Strassen's algorithm. Suppose 
we double the matrix sizes, interpreting them as matrices with four blocks. 
Using the classical algorithm, multiplying two 1 792 x 1 792 sized matrices would 
take eight multiplication of the blocks. Using a divide-and-conquer strategy, 
we can exchange one multiplication for a few extra additions. 

-D2 + D 4 + D 5 + D 6 D!+D 2 

D3 + D4 0,-03 + 05-07 



"An 


A, 2 " 




"Bn 


B 12 " 




A 2 i 


A 22 _ 




B21 


B 2 2 
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where 



Di 

D 2 
D 3 
D 4 
D 5 
D 6 
D 7 



Aii(B 12 -B 2 2) 
(An +A 12 )B 22 
(A 2 i +A 22 )Bn 
A 22 (B 21 -Bn) 



(An +A 22 )(Bn + B 22 ) 
(A 12 -A 22 )(B 21 +B 22 ) 
(An -A 21 )(Bn +B 12 ). 



This algorithm, with its 0(n lg 7 ) time complexity, could speed up the design on 
large matrices. We should note however, that the speed of the extra additions 
have to be carefully considered. Since the multiplication is already extremely 
fast, a similar improvement may also be necessary for additions if the overall 
performance upgrade is to remain significant. 
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